Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. #26424

Conversation

Zantares
Copy link
Contributor

@Zantares Zantares commented Mar 7, 2019

This could reduce the memory access and get good cache locality for CPU.

modified:

  • tensorflow/core/kernels/training_ops.cc
  • tensorflow/core/kernels/training_ops.h
  • tensorflow/core/kernels/training_ops_gpu.cu.cc

Signed-off-by: Lu Teng teng.lu@intel.com

This could reduce the memory access and get good cache locality for CPU.

modified:
- tensorflow/core/kernels/training_ops.cc
- tensorflow/core/kernels/training_ops.h
- tensorflow/core/kernels/training_ops_gpu.cu.cc

Signed-off-by: Lu Teng teng.lu@intel.com
@Zantares
Copy link
Contributor Author

Zantares commented Mar 7, 2019

Eigen device expression can only update 1 variable once, but Adam needs to update 3 variables and uses 3 expression which would impact the cache locality of CPU. Here use Shard function to replace Eigen device expression.

This patch is tested on NCF model which is in MLPerf 0.5 submission.
It can speed up Adam kernel 15%~30%, and improve 10%~20% overall for the model.

@Zantares Zantares changed the title Use Shard function instead of Eigen device to parallelize Adam kernel. [Intel MKL]Use Shard function instead of Eigen device to parallelize Adam kernel. Mar 7, 2019
@Zantares Zantares changed the title [Intel MKL]Use Shard function instead of Eigen device to parallelize Adam kernel. [Intel MKL] Use Shard function instead of Eigen device to parallelize Adam kernel. Mar 7, 2019
@rthadur rthadur requested a review from yifeif March 7, 2019 20:58
@rthadur rthadur added awaiting review Pull request awaiting review size:M CL Change Size: Medium labels Mar 7, 2019
@rthadur rthadur added this to Assigned Reviewer in PR Queue via automation Mar 7, 2019
@rthadur rthadur self-assigned this Mar 7, 2019
@agramesh1
Copy link
Contributor

pinging @ezhulenev for a review. Thanks.

Copy link
Member

@ezhulenev ezhulenev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add a benchmark similar to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/bias_op_test.cc, I'll need it to run performance testing internally.

tensorflow/core/kernels/training_ops.cc Show resolved Hide resolved
tensorflow/core/kernels/training_ops.cc Outdated Show resolved Hide resolved
@tensorflowbutler tensorflowbutler removed the awaiting review Pull request awaiting review label Mar 12, 2019
To get better cache locality, use Shard instead of Eigen expression.
Also added a benchmark to test Adam performance.
@Zantares
Copy link
Contributor Author

Could you please also add a benchmark similar to https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/bias_op_test.cc, I'll need it to run performance testing internally.

Hi, @ezhulenev , I've refined the code and added a benchmark https://github.com/tensorflow/tensorflow/pull/26424/files#diff-0b9bd0c5daec98f25d2e15c9b8c0370cR200.

My test result is:

original:
Benchmark Time(ns) Iterations

BM_SGD/131072 58816 10000 8914.0MB/s 2228.5M items/s
BM_SGD/262144 107837 6335 9723.7MB/s 2430.9M items/s
BM_Adagrad/131072 136092 5126 3852.5MB/s 963.1M items/s
BM_Adagrad/262144 216470 3093 4844.0MB/s 1211.0M items/s
BM_Momentum/131072 126981 5114 4128.9MB/s 1032.2M items/s
BM_Momentum/262144 206378 3434 5080.9MB/s 1270.2M items/s
BM_Adam/131072/0 194416 3452 2696.7MB/s 674.2M items/s
BM_Adam/262144/0 334504 2110 3134.7MB/s 783.7M items/s
BM_Adam/16777216/1 9864090 100 6803.4MB/s 1700.8M items/s
BM_RMSProp/131072 187562 3545 2795.3MB/s 698.8M items/s
BM_RMSProp/262144 334770 2180 3132.2MB/s 783.1M items/s
BM_AddSign/131072 512574 1449 1022.9MB/s 255.7M items/s
BM_AddSign/262144 922186 727 1137.1MB/s 284.3M items/s
BM_PowerSign/131072 2179936 311 240.5MB/s 60.1M items/s
BM_PowerSign/262144 3963514 177 264.6MB/s 66.1M items/s

optimized:
Benchmark Time(ns) Iterations

BM_SGD/131072 69680 9636 7524.2MB/s 1881.0M items/s
BM_SGD/262144 95855 5395 10939.2MB/s 2734.8M items/s
BM_Adagrad/131072 158376 5181 3310.4MB/s 827.6M items/s
BM_Adagrad/262144 234968 2831 4462.6MB/s 1115.7M items/s
BM_Momentum/131072 118026 5495 4442.1MB/s 1110.5M items/s
BM_Momentum/262144 215430 3169 4867.4MB/s 1216.8M items/s
BM_Adam/131072/0 202429 3486 2590.0MB/s 647.5M items/s
BM_Adam/262144/0 328765 1965 3189.4MB/s 797.4M items/s
BM_Adam/16777216/1 7393820 100 9076.3MB/s 2269.1M items/s
BM_RMSProp/131072 187683 3003 2793.5MB/s 698.4M items/s
BM_RMSProp/262144 372268 2006 2816.7MB/s 704.2M items/s
BM_AddSign/131072 648737 1000 808.2MB/s 202.0M items/s
BM_AddSign/262144 876228 764 1196.7MB/s 299.2M items/s
BM_PowerSign/131072 2178264 322 240.7MB/s 60.2M items/s
BM_PowerSign/262144 3908483 174 268.3MB/s 67.1M items/s


There may be some data variance between different execution, but you can find the optimized is always better than the original.

@Zantares
Copy link
Contributor Author

BTW, the Tensor vectorization form has similar performance with the loop version, I guess maybe Eigen can generate same ASM internally, but I didn't dig too deep.

tensorflow/core/kernels/training_ops.cc Outdated Show resolved Hide resolved
length = length / size;
} else {
size = 1;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to divide the input size by the packet size, and do "manual vectorization". If it's desirable to have shard size (end-begin) to be a multiple of a packet size, you can pass block_align to parallelFor (see https://bitbucket.org/eigen/eigen/src/4b28c8008901c6d760f48f26ee2e3423fd8a2b40/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h#lines-185). \

I think this should work:

[](Index index) -> Index { return Eigen::divup(index, packet_size); }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got some question when try to use this function, please see my comment below.

tensorflow/core/kernels/training_ops_test.cc Outdated Show resolved Hide resolved
PR Queue automation moved this from Assigned Reviewer to Reviewer Requested Changes Mar 14, 2019
@ezhulenev
Copy link
Member

I guess after inlining it all might have been fused into a single loop by compiler. Anyway it's great that there is no performance difference and we can keep simpler code.

@Zantares
Copy link
Contributor Author

Zantares commented Mar 15, 2019

When I try to use block_align to align shard size, I found the performance was decreased in real model, then I captured the param size from model and made a small benchmark: e4dae32#diff-0b9bd0c5daec98f25d2e15c9b8c0370cR200.

env: Intel Xeon skylake-8180, 56 cores
cmd: numactl -N 0 -l bazel run --config=mkl --copt=-mavx2 --copt=-mfma --copt=-march=broadwell --copt=-O2 --copt=-L$HOME/code/1/gcc6/gcc6.3/lib64/ -- //tensorflow/core/kernels:training_ops_test -- --benchmarks=..

current implementation:

BM_Adam/8192/1 82133 8295 399.0MB/s 99.7M items/s
BM_Adam/16777216/1 8501990 100 7893.3MB/s 1973.3M items/s

with block_align:

BM_Adam/8192/1 88312 7707 371.0MB/s 92.8M items/s
BM_Adam/16777216/1 8462090 100 7930.5MB/s 1982.6M items/s


With "manual vectorization", the small benchmark will get better performance(+10%). It's really confused me, maybe Eigen efficiency model https://bitbucket.org/eigen/eigen/src/4b28c8008901c6d760f48f26ee2e3423fd8a2b40/unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h?fileviewer=file-view-default#TensorDeviceThreadPool.h-188 can't handle small size with block_align well?


How I use block_align

+    // Set a function to align block size to packet size, which can get more
+    // chance to vectorize.
+    auto block_align = [packet_size](Index block_size) -> Index {
+      return Eigen::divup(block_size, packet_size) * packet_size;
+    };
+    d.parallelFor(length, cost, block_align, shard)

block_align will get the block_size computed by Eigen efficiency model, it allows us round up the size with own rule and return it as the new block size. I must increase the block size in block_align , or it will trap in an infinity loop.
Base on the result, I prefer to use "manual vectorization" version, how do you think about this situation?

@ezhulenev
Copy link
Member

That's strange. I'll try to reproduce it internally after it will be merged.

ezhulenev
ezhulenev previously approved these changes Mar 15, 2019
PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 15, 2019
@ezhulenev
Copy link
Member

I think the problem is in incorrectly computed cost, and Eigen sharding too much or too less.

PR Queue automation moved this from Approved by Reviewer to Reviewer Requested Changes Mar 19, 2019
@Zantares
Copy link
Contributor Author

Zantares commented Mar 19, 2019

I think the problem is in incorrectly computed cost, and Eigen sharding too much or too less.

Hi @ezhulenev ,please take a look at the new commit, I fixed an error of computing cost - need to multiply length to compute_cycles. I also checked the CostModel in Eigen, it already had some estimation on cache so I added the store cost. This should be the last commit if no more review suggestion.

The 'manual vectorization' is still better than block_align, I guess because compiler could get more static info from 'manual vectorization' while block_align may generate a tail that can't be divided in run-time.

PR Queue automation moved this from Reviewer Requested Changes to Approved by Reviewer Mar 19, 2019
@rthadur rthadur added kokoro:force-run Tests on submitted change ready to pull PR ready for merge process labels Mar 19, 2019
@kokoro-team kokoro-team removed the kokoro:force-run Tests on submitted change label Mar 19, 2019
@tensorflow-copybara tensorflow-copybara merged commit 2160c84 into tensorflow:master Mar 20, 2019
PR Queue automation moved this from Approved by Reviewer to Merged Mar 20, 2019
tensorflow-copybara pushed a commit that referenced this pull request Mar 20, 2019
@Zantares Zantares deleted the Intel-TF/tenglu/fuse_adam branch March 21, 2019 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes ready to pull PR ready for merge process size:M CL Change Size: Medium
Projects
PR Queue
  
Merged
Development

Successfully merging this pull request may close these issues.

None yet

8 participants